4 research outputs found

    Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

    Full text link
    Automatic detection and severity level classification of dysarthria directly from acoustic speech signals can be used as a tool in medical diagnosis. In this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor to build detection and severity level classification systems for dysarthric speech. The experiments were carried out with the popularly used UA-speech database. In the detection experiments, the results revealed that the best performance was obtained using the embeddings from the first layer of the wav2vec model that yielded an absolute improvement of 1.23% in accuracy compared to the best performing baseline feature (spectrogram). In the studied severity level classification task, the results revealed that the embeddings from the final layer gave an absolute improvement of 10.62% in accuracy compared to the best baseline features (mel-frequency cepstral coefficients)

    Äänihäiriöiden automaattinen tunnistus ja moniluokkainen luokittelu puhetallenteista

    No full text
    Automatic detection of voice disorders from speech signal has the potential to improve the reliability of medical diagnosis. Most of the earlier studies have focused on the binary detection of disorders without a discrimination between different disorder types. In this thesis, a systematic examination of different speaking tasks, audio features, and classifiers was conducted in the contexts of binary detection and multi-class classification. The goal was to find the system that achieves the best classification performance, and to study the complementary information between different speaking tasks and features. The examined speaking tasks were the sustained pronunciation of a vowel and a pronunciation of a sentence. The examined features included a set of cepstral coefficients and perturbation measures. Several commonly used classifiers were included. The primary multi-class classifier in this thesis was a hierarchical classifier, that has not been studied often in the domain. The hierarchy is a sequence of increasingly detailed classifications, which is based on a practical scenario. First, the classification was performed between healthy and disordered speech, followed by the classification between hyper functional dysphonia and vocal fold paresis. The results indicate that the proposed hierarchical system performs comparably or better than the traditionally used multi-class systems, achieving the multi-class classification accuracies of 59.00 % and 62.31 % for female and male speakers, respectively. The best accuracies in the first step of the hierarchy were 78.58 % and 79.87 % for female and male speakers, respectively. In the classification between the disorder types, the best accuracies were 66.20 % and 73.11 % for female and male speakers, respectively. In addition, this thesis reports several findings regarding the performances of different speaking tasks, features and classifiers.Äänihäiriöiden automaattinen havaitseminen puhesignaalista voi parantaa lääketieteellisen diagnoosin luotettavuutta. Suurin osa aiemmista tutkimuksista on keskittynyt häiriöiden binääriseen havaitsemiseen ilman luokittelua eri häiriötyyppien välillä. Tässä opinnäytetyössä tarkasteltiin systemaattisesti erilaisia puhetehtäviä, äänisignaalista johdettuja piirrevektoreita ja luokittimia moniluokkaisen luokituksen yhteydessä. Tavoitteena oli löytää järjestelmä, jolla saavutetaan paras luokittelutarkkuus, ja tutkia eri puhetehtävien ja piirrevektoreiden välillä esiintyvää komplementaarista informaatiota. Tarkasteltavina puhetehtävinä olivat vokaalin ja lauseen ääntäminen. Tutkitut piirrevektorit sisälsivät joukon kepstraalikertoimia ja häiriömittareita. Mukana oli useita yleisesti käytettyjä luokittimia. Tässä opinnäytetyössä ensisijainen moniluokkaluokitin oli hierarkinen luokitin, jota ei ole yleisesti tutkittu äänihäiriöiden tunnistamisen yhteydessä. Käytetty hierarkia on sarja yhä yksityiskohtaisempia luokituksia, jotka perustuvat käytännön skenaarioon. Ensin luokittelu suoritettiin terveen ja häiriintyneen puheen välillä, jota seurasi luokittelu hyperfunktionaalisen dysfonian ja äänihuulihalvauksen välillä. Tulokset osoittavat, että hierarkkinen järjestelmä suoriutuu vertailukelpoisesti tai paremmin kuin perinteisesti käytetyt moniluokkajärjestelmät. Sen saavuttama moniluokkainen luokitustarkkuus oli 59,00 % naisille ja 62,31 % miehille. Parhaat tarkkuudet hierarkian ensimmäisessä vaiheessa olivat 78,58 % naisille ja 79,87 % miehille. Häiriötyyppien välisessä luokittelussa parhaat tarkkuudet olivat 66,20 % naisille ja 73,11 % miehille. Lisäksi tässä opinnäytetyössä raportoidaan useita havaintoja liittyen eri puhetehtävien, piirrevektoreiden ja luokittelijoiden suorituskykyyn

    Hierarchical Multi-class Classification of Voice Disorders Using Self-supervised Models and Glottal Features

    No full text
    Previous studies on the automatic classification of voice disorders have mostly investigated the binary classification task, which aims to distinguish pathological voice from healthy voice. Using multi-class classifiers, however, more fine-grained identification of voice disorders can be achieved, which is more helpful for clinical practitioners. Unfortunately, there is little publicly available training data for many voice disorders, which lowers the classification performance on data from unseen speakers. Earlier studies have shown that the usage of glottal source features can reduce data redundancy in detection of laryngeal voice disorders. Another approach to tackle the problems caused by scarcity of training data is to utilize deep learning models, such as wav2vec 2.0 and HuBERT, that have been pre-trained on larger databases. Since the aforementioned approaches have not been thoroughly studied in the multi-class classification of voice disorders, they will be jointly studied in the present work.In addition, we study a hierarchical classifier, which enables task-wise feature optimization and more efficient utilization of data. In this work, the aforementioned three approaches are compared with traditional mel frequency cepstral coefficient (MFCC) features and one-vs-rest and one-vs-one SVM classifiers. The results in a 3-class classification problem between healthy voice and two laryngeal disorders (hyperfunctional dysphonia and vocal fold paresis) indicate that all the studied methods outperform the baselines. The best performance was achieved by using features from wav2vec 2.0 LARGE together with hierarchical classification. The balanced classification accuracy of the system was 62.77% for male speakers, and 55.36% for female speakers, which outperformed the baseline systems by an absolute improvement of 15.76% and 6.95% for male and female speakers, respectively.Peer reviewe
    corecore